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Abstract 

Recent approaches to distributed model fitting 
rely heavily on consensus ADMM, where each 
node solves small sub-problems using only lo¬ 
cal data. We propose iterative methods that solve 
global sub-problems over an entire distributed 
dataset. This is possible using transpose re¬ 
duction strategies that allow a single node to 
solve least-squares over massive datasets with¬ 
out putting all the data in one place. This results 
in simple iterative methods that avoid the expen¬ 
sive inner loops required for consensus methods. 

To demonstrate the efficiency of this approach, 
we fit linear classifiers and sparse linear models 
to datasets over 5 Tb in size using a distributed 
implementation with over 7000 cores in far less 
time than previous approaches. 

1. Introduction 

We study optimization routines for problems of the form 

minimize f{Dx), (1) 

where D G is a (large) data matrix and / is a convex 

function. We are particularly interested in the case that £> is 
stored in a distributed way across JV nodes of a network or 
cluster. In this case, the matrix D — {Dj, • • • , 

is a vertical stack of sub-matrices, each of which is stored 
on a node. If the function / decomposes across nodes as 
well, then problem (1) takes the form 

minimize E MD,x), (2) 

i<N 

where the summation is over the N nodes. Problems of this 
form include logistic regression, support vector machines. 


lasso, and virtually all generalized linear models (Hardin & 
Hilbe, 2001). 

Most distributed solvers for equation (2), such as ADMM, 
are built on the assumption that one cannot solve global op¬ 
timization problems involving the entire matrix D. Rather, 
each node alternates between solving small sub-problems 
involving only local data, and then exchanging information 
with other nodes. 

This work considers methods that solve global optimiza¬ 
tion problems over the entire distributed dataset on each 
iteration. This is possible using transpose reduction meth¬ 
ods. Such schemes exploit the following simple observa¬ 
tion; when D has many more rows than columns, the ma¬ 
trix D is considerably smaller than D. The availability 
of D^D enables a single node to solve least-squares prob¬ 
lems involving the entire data matrix D. Furthermore, in 
many applications it is possible (and efficient) to compute 
D^D in a distributed way. This allows our approach to 
solve extremely large optimization problems much faster 
than the current state-of-the art. We support this conclusion 
both with theoretical results in Section 8 and with experi¬ 
mental results in Section 10. 

2. Related Work 

Simple first-order methods (methods related to gradient de¬ 
scent) have been available for smooth minimization prob¬ 
lems for decades (Tsitsiklis et al., 1986; Rabbat & Nowak, 
2004). However, slow performance of gradient schemes 
for poorly conditioned problems and slow (sublinear) con¬ 
vergence of sub-gradient and stochastic methods for non- 
differentiable problems has led many data scientists to con¬ 
sider splitting schemes. 

Much recent work on solvers for formulation (2) has fo¬ 
cused on the Alternating Direction Method of Multipliers 
(Glowinski & Marroco, 1975; Glowinski & Tallec, 1989; 
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Goldstein & Osher, 2009), which has become a staple of 
the distributed computing and image processing literature. 
The authors of (Boyd et al., 2010) propose using ADMM 
for distributed model fitting using the “consensus” formu¬ 
lation. Consensus ADMM has additionally been studied 
for distributed model fitting (Erseghe et al., 2011), sup¬ 
port vector machines (Forero et al., 2010), and numerous 
domain-specific applications (Chen et al., 2008; He et al., 
2012). Many variations of ADMM have subsequently been 
proposed, including specialized variants for decentralized 
systems (Mota et al., 2013), asynchronous updates (Zhang 
& Kwok, 2014; Ouyang et al., 2013), inexact solutions to 
subproblems (Chang et al., 2015), and online/stochastic up¬ 
dates (Ouyang et al., 2013). 

3. Background 

The alternating direction method of multipliers (ADMM) 
is a general method for solving the problem 

minimize g{x) + h{y) 
subject to Ax -\- By = 0. 

The ADMM enables each term of problem (3) to be ad¬ 
dressed separately. The algorithm in its simplest form be¬ 
gins with estimated solutions z°, and a Lagrange mul¬ 
tiplier A°. ADMM then generates the following iterates; 

= sxgmin g{x) +''-\\Ax + By^ + 

X ^ 

yfc+i =argmin/r(y)-f + (4) 

V 2 

Afc+1 = -f By^+^ 

with some positive stepsize parameter r > 0. 

Disparate formulations are achieved by different A, B, f, 
and g. For example, consensus ADMM (Boyd et al., 2010) 
addresses the problem 

minimize E Mx^) 

* (5) 

subject to Xi = z for all i 

which corresponds to problem (3) with B = {I, I, ■ ■ ■ I)"’", 
A = I, f{x) = fi{xi), and g = 0. Rather than solv¬ 
ing a single global problem, the consensus ADMM solves 
many small problems in parallel. On each iteration of con¬ 
sensus ADMM, node i performs the update 

x^+^ = argmin fi{xi) + l-\\xi - zip. (6) 

Xi ^ 

The shared global variable z is then updated by the central 
server. Finally, by updating the Lagrange multipliers {Ai}, 
the solutions to each sub-problem are forced to be progres¬ 
sively more similar on each iteration. 


4. Transpose Reduction Made Easy: 
Regularized Least-Squares 

Transpose reduction is most easily described for regular¬ 
ized least-squares problems. Consider the distributed solu¬ 
tion of the problem 

minimize J{x)+ ^\\Dx — h\\‘^ (7) 

for some penalty term J. When J{x) = yL\x\ for some 
scalar p, this becomes the lasso regression (Tibshirani, 
1994). Typical consensus solvers for problem (7) require 
each node to compute the solution to equation (6), which is 
here given by 

= argmin^llAa^i - 6j|P + l:\\xi - z'^lp 

= {DfD. + Tiy^Dfn + Tz). 

During the setup phase for consensus ADMM, each node 
forms the matrix Df Di , and then computes and caches the 
inverse (or equivalently the factorization) of {Df Di + rl). 

Alternatively, transpose reduction can solve distributed 
ADMM on a single machine without moving the entire ma¬ 
trix D to one place. Using the simple identity 

]^\\Dx - b\f =^{Dx -b,Dx- b) 

= ]^x'^[D'^D)x - x'^D^b +]^\\b\\^ 

we can replace problem (7) with the equivalent problem 

minimize J{x) + -x^ {D"^ D)x — (8) 

To solve problem (8), the central server needs only the ma¬ 
trix D^D and the vector 79^6. When 79 is a large “tall” ma¬ 
trix 79 S with n m, D has only n? (rather 

than nm) entries, which is practical to store on a single 
server. Furthermore, because 

D^D = J2DfD,, and D^b = '^Dfbi 

i i 

the matrix D^D can be formed by having each server com¬ 
pute is own local Df Di, and then aggregating/reducing the 
results on a central server. 

Once D^D and D"^b have been computed in the cloud and 
cached on a central server, the global problem is solved 
on a single node. This is done using either a single-node 
ADMM method for small dense lasso (see (Boyd et al., 
2010) Section 6.4) or a forward-backward (proximal) split¬ 
ting method (Goldstein et al., 2014b). The latter approach 
only requires the gradient of 4a;^(79^79)x — x^79^5, 
which is given by D^Dx — D^b. 
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5. Unwrapping ADMM: Transpose Reduction 
for General Problems 


Transpose reduction can be exploited for more complex 
problems using ADMM. On each iteration of the proposed 
method, least-squares problems are solved over the entire 
distributed dataset. In contrast, each step of consensus 
ADMM relies on sub-problems involving only small sub¬ 
sets of the data. 


We aim to solve problem (1). We begin by “unwrapping” 
the matrix D; we remove it from / using the formulation 


minimize f{y) 
subject to y = Dx. 


(9) 


Applying the ADMM with A = B = —I, h = f, and 
g = 0 yields Algorithm 1 . 


Algorithm 1 Unwrapped ADMM 
1: Choose initial x°, y°, A°, and r > 0 
2: while not converged do 

3: = argmin^ ||Z3x —j/^-|-A^|P = D^{y^ — \^) 

4 : = argmiUj^ f{y) + — y -f A^|P 

5: A'=+i = \^ + Dx^+^ - 

6: end while 


We will now examine the steps in Algorithm 1. The x up¬ 
date requires the solution of a global least squares problem 
over the entire dataset. The solution requires the pseudoin¬ 
verse of D, given by 

D+ = {d'^D)-^D^. 

The y update can be represented using the proximal map¬ 
ping of /, which is given by 

proxj:{z,S) = argmin/(y) -f ^\\y - zf. 

Using this notation. Line 4 of Algorithm 1 is written 
proxj:{Dx'^~^^ + A^,t“^). Provided / is decomposable, 
the minimization in Line 4 is coordinate-wise decoupled. 
Each coordinate of is computed with either an analyt¬ 
ical solution, or using a simple 1-dimensional lookup table 
of pre-computed solutions. 

5.1. Distributed Implementation 


update in Algorithm 1 becomes 

x’^+^ = D+{y>^ - X’^) = lU^ A(yf - A,"), (10) 

i 

where W = Dj Di)~^ . Each vector — Aj^) can 

be computed locally on node i. Multiplication by W must 
take place on a central server. 

Note the massive dimensionality reduction that takes place 
when D = formed. For a data matrix 

D e with n ^ m, the Gram matrix D'^ D G 

is small, even in the case that D is far too large to store 
on the central server. The complete distributed method is 
listed in Algorithm 2. 


Algorithm 2 Unwrapped ADMM with Transpose Reduc¬ 
tion_ 

1: Choose global a;°, r, and local {y°}, {A?} on each 
node i 

2: All nodes: Wi = Df Di, Vi 
3: Central node: W = 

4: while not converged do 

5: All nodes: df = Df (yf — Af), Vi 

6: Central Node: x^~^^ = 

7: All nodes: 

= argmin^. fi{yi) + f ||- y» + Aj’p 
= prox^. -f Af,T“^),Vi 

8: All nodes: X’l^^ = X’^ + DiX^+^ - yf+^ 

9: end while 


6. Application: Fitting linear classifiers 

6.1. Logistic regression 

A logistic classiher maps a feature vector d to probabilities 
using a mapping of the form d i—some row 
vector X of weights. Given a matrix D G R™xn 
ture vectors, and a vector I G R"* of labels, the regression 
weights are found by solving 

m 

minimize E log(l -f exp(-lk(Dkx))) 

fc=i 

= flr{Dx), (11) 

where fir is the logistic regression loss (negative log- 
likelihood) function 


When D = {Df, Df, • • • , Df)"’" is large and distributed 
over N nodes, no node has access to the entire matrix D 
to solve the global least squares problem for This is 

where we exploit the transpose reduction. 

We can decompose the rows of y = {yf, yf, • • • , yf)’’" 
and A = {Xf, Xf, ■ ■ ■ , Xf)"’". The constraint in formula¬ 
tion (9) now becomes y^ = DiX, and the least-squares x 


flr{z) = ^ log(l + exp(-/fcZfc)). 
k^l 

To apply unwrapped ADMM, we must evaluate the proxi¬ 
mal operator 

WOXf^^{z,T~'’) = argminfiriz) -f ^||y- 
y ^ 
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This is a convex, 1-dimensional optimization problem, and 
the solution is easily computed with a few iterations of 
Newton’s method. However, because the proximal opera¬ 
tor depends on only one input variable, we suggest forming 
a lookup table of pre-computed solutions. 

6.2. Support Vector Machine 

A common formulation of the support vector machine 
(SVM) solves 


and then minimizing f{bx). In the distributed setting 
where / and D are computed/stored across N servers, the 
problem now has the form (2) with 

minimize E MD^x) (15) 

i<N+l 

where fN+i{z) = and D^+i = I. 

7.1. Splitting Over Columns 


minimize + Ch{Dx) (12) 

where C is a regularization parameter, and h is a simple 
“hinge loss” function given by 


When the matrix D is extremely wide (m ^ n), it often 
happens that each server stores a subset of columns of D 
rather than rows. Fortunately, such problems can be han¬ 
dled by solving the dual of the original problem. The dual 
of the sparse problem (14) is given by 


M 

k^l 

The proximal mapping of h has the form 

prox;j(z, 5)k = Zk + h max{min{l - lkZk,d},0}. 

Using this proximal operator, the solution to the y update 
in Algorithm 1 is simply 


minimize /*(«) subject to \\D'^a\\ac> ^ (16) 

OL 

where f* denotes the Fenchel conjugate (Boyd & Vanden- 
berghe, 2004) of /. For example the dual of the lasso prob¬ 
lem is simply 

minimize x||q; + 6|P subject to ||£>^a||oo < 

a 2 

Problem (16) then reduces to the form (1) with 


= prox^ (^Dx'^+^ + A^ . 

Note that this algorithm is much simpler than the consensus 
implementation of SVM, which requires each node to solve 
the sub-problem 

minimize Ch{Dx) +'^\\x — y\\‘^. (13) 



l\\zk + bkW^, for 1 < fc < TO 
X{zk), for Ac > TO 


where A'(z) is the characteristic function of the £oo ball 
of radius /r. The function A'(z) is infinite when \z^\ > /i 
for some A, and zero otherwise. The unwrapped ADMM 
for this problem requires the formation of DiDj on each 
server, rather than Df Di. 


Despite the similarity of this problem to the original SVM 
(12), this problem form is not supported by available SVM 
solvers such as LIBSVM (Chang & Lin, 2011) and others, 
and thus requires a custom solver (see Section A in the Ap¬ 
pendix). 


7. Handling Sparsity 

Variable-selection methods use £i regularization to obtain 
sparse solutions. For high dimensional problems, reg¬ 
ularization is commonplace to avoid overfitting. Sparse 
model fitting problems have the form 

minimize y\x\ -f f{Dx) (14) 


for some regularization parameter y > 0. Sparse problems 
can be trivially reduced to the form (1) by defining 


b = 



fiz)k = 


y\zk\, for 1 < k < n 
fkizk), for Ac > n 


8. Convergence Theory 

The convergence of ADMM is in general well understood. 
Classical results guarantee convergence using monotone 
operator theory (Eckstein & Bertsekas, 1992), however 
they do not provide convergence rates. A more recent re¬ 
sult due to He and Yuan (He & Yuan, 2012), provides a 
global convergence rate for the method. Theorem 1 below 
presents this result, as adapted to the problem form (3). 

Theorem 1. (He and Yuan) Consider the iterates gener¬ 
ated by the ADMM method 4. If f and g are proper convex 
functions, then 

my^+^ - yb\? + \\Ax'^+^ + By^+^\\^ 

^ Hg(A/°-AA*)ll^ + l|A°-Alp 

“ Ac -f 1 

where y* is an optimal solution for (3) and \* is the corre¬ 
sponding optimal Lagrange multiplier. 
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It was observed in (Boyd et al., 2010) that \\Ax^^^ + 
j^yk+i \^\2 \\A^— y^)\\^ are measures of pri¬ 

mal and dual infeasibility, respectively. Thus, Theorem 1 
guarantees the iterates of ADMM approach primal and dual 
feasibility (and thus optimality) as fc —oo, and that the 
infeasibility errors decrease with rate 0{\/k) in the worst 
case. We can specialize this result to the case of unwrapped 
ADMM to obtain Corollary 1. 

Corollary 1. If formulation (1) is feasible, then the iterates 
of the unwrapped ADMM 1 satisfy 

||yfe + l-yfe||2 + ||^^fc-tl_yfc+l|j2 

^ \\yO-Dx*r + \\X°-Xf\^ 

~ k + 1 

where y* is an optimal solution for (3) and A* is the corre¬ 
sponding optimal Lagrange multiplier. 

While Theorem 1 and Corollary 1 show convergence in the 
sense that primal and dual feasibility are attained for large 
k, it seems more natural to ask for a direct measure of opti¬ 
mality with respect to the objective function (1). In the case 
that / is differentiable, we can exploit the simple form of 
unwrapped ADMM and show the derivative of ( 1 ) goes to 
zero, thus directly showing that is a good approximate 
minimizer of (1) for large k. 

Theorem 2. If the gradient of f exists and has Lipschitz 
constant L{\If), then Algorithm 1 shrinks the gradient of 
the objective function (1) with global rate 

\\^{f{Dx^)}\\^ = \\D^^f[Dx^)}\\^ 

\\y^-Dx*r + \\\^-Mr 


We now have 


d,{f{Dx^)} =D^Vf{Dx'^) = + Dx^ - y^) 

=D^^f{y^ + Dx’^ - y^) 

-D'^^f{y^) + TD^{y^ -y^+^) 

and so \\dx{f{Dx’‘)}\\ is bounded above by 

< L{^f)\\D^\\op\\Dx'‘ - y'^ll + rp^lUpIl/ - /■+i||. 

where ||iA^||op denotes the operator norm of . 

Now, by Corollary (1), we know that both 
||iAx^ — y^ll and \\y^ — are bounded above 

by Applying this bound to (8) 

yields 


l!a.{/(^a;'=)}|| < 



Dx*\\^ + ||A0 
k 



We obtain the result by squaring this inequality and noting 
that\\D\\lp = piD^D). □ 


Note the logistic regression problem (11) satisfies the con¬ 
ditions of Theorem 2 with L(V f) = 1/4. 

Finally, we remark that better convergence rates are pos¬ 
sible in special cases. For example when / is strongly 
convex, accelerated ADMM obtains 0(1/k^) convergence 
(Goldstein et al., 2014a), and if we further assume D has 
full row rank, R-linear convergence is possible (Deng & 
Yin, 2012). 


where C = f) -f rf^p{D^D) is a constant and 

p{D"^D) denotes the spectral radius of D. 

Proof We begin by writing the optimality condition for the 
x-update in Algorithm 1 : 

D'^{Dx’^+^ -y^ + A'=) 

= - /) = 0. (18) 

Note we used the definition = A^ -f Dx^^^ — y^^^ 
to simplify (18). Similarly, the optimality condition for the 
y-update yields 

- Dx’^^^ - A'^) 

= V/(2/'=+i) - rA'^+i = 0, 

or equivalently f{y^^^) = tA^’*'^. Combining this with 
equation (18) yields 

D'^^f{y^) = TD^{y’^ -y’^+^). 


9 . Implementation Details 

We compare transpose reduction methods to consensus 
ADMM using both synthetic and empirical data. We study 
the transpose reduction scheme for lasso (Section 4) in ad¬ 
dition to the unwrapped ADMM (Algorithm 2) for logistic 
regression and SVM. 

We built a distributed implemention of the transpose reduc¬ 
tion methods along with consensus optimization, and ran 
all experiments on Armstrong, a 30,000-core Cray XC30 
hosted by the DOD Supercomputing Resource Center. This 
allow us to study experiments ranging in size from very 
small to extremely large. 

All distributed methods were implemented using MPI. 
Stopping conditions for both transpose reduction and con¬ 
sensus methods were set using the residuals defined in 
(Boyd et al., 2010) with e^e/ = 10“^, and Cats = 10“®. 

Many steps were taken to achieve top performance of the 
consensus optimization routine. The authors of (Boyd 
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et al., 2010) suggest using a stepsize parameter r = 1; how¬ 
ever, substantially better performance is achieved by tuning 
this parameter. In our experiments, we tuned the stepsize 
parameter to achieve convergence in a minimal number of 
iterations on a problem instance with m = 10,000 data 
vectors and n — 100 features per vector, and then scaled 
the stepsize parameter up/down to be proportional to m. It 
was found that this scaling made the number of iterations 
nearly invariant to n and m. The iterative solvers for each 
local logistic regression/SVM problem were warm-started 
using solutions from the previous iteration. 

The logistic regression subproblems were solved using a 
limited memory BFGS method (with warm start to accel¬ 
erate performance). The transpose reduced lasso method 
(Section 4 ) requires a sparse least-squares method to solve 
the entire lasso problem on a single node. This was accom¬ 
plished using the forward-backward splitting implementa¬ 
tion PASTA (Goldstein et al., 2014b; 2015). 

Note that the SVM problem (13) cannot be solved by con¬ 
ventional SVM solvers (despite its similarity to the clas¬ 
sical SVM formulation). For this reason, we built a cus¬ 
tom solver using the same coordinate descent techniques as 
the well-known solver LIBSVM (Chang & Lin, 201 1). By 
using warm starts and exploiting the structure of problem 
(13), our custom method solves the problem (13) dramat¬ 
ically faster than standard solvers for problem (12). See 
Appendix A for details. 

10. Numerical Experiments 

To illustrate the behavior of transpose reduction methods 
on various data corpus sizes and multiple optimization 
problems, we applied consensus and transpose solvers to 
both synthetic and empirical datasets. We recorded both the 
total compute time and the wallclock time. We define total 
compute time as the total amount of time computing cores 
spend performing calculations. Total compute time does 
not include communication and extra diagnostic computa¬ 
tion not necessary for optimization (such as per-iteration 
computation of the objective function); wall time includes 
all calculation and communication. 

10.1. Synthetic Data 

Synthetic data was constructed as follows: 

Lasso problems We use the same synthetic problems used 
to study consensus ADMM in (Boyd et al., 2010). The 
data matrix D is a random Gaussian matrix. The true so¬ 
lution Xtrue contains 10 active features with unit magni¬ 
tude, and the remaining entries are zero. The £i penalty 
pL is chosen as suggested in (Boyd et al., 2010) — i.e., the 
penalty is 10% the magnitude of the penalty for which the 
solution to (7) becomes zero. The observation vector is 


b = Dxtrue + 7: where p is a standard Gaussian noise vec¬ 
tor with (7 = 1. 

Classification problems We generated two random Gaus¬ 
sian matrices, one for each class. The first class consists 
of zero-mean Gaussian entries. The first 5 columns of the 
second matrix were random Gaussian with mean 1, and the 
remaining columns were mean zero. Note the data classes 
generated by this process are not perfectly linearly separa¬ 
ble. The £\ penalty was set using the “10%” rule used in 
(Boyd et al., 2010). 

In addition to these “standard” test problem, we also study 
the important (and often more realistic) case where data is 
heterogeneous across nodes. 

Heterogeneous Data To create data heterogeneity, we 
chose one random Gaussian scalar for each node, and 
added it to the matrix Di. This has the effect of making the 
behavior of the sub-problems on each node different. Note 
that, with standard Gaussian test problems, each consen¬ 
sus node is essentially solving the same problem (i.e., data 
matrices have perfectly identical statistical properties), and 
thus they arrive at a consensus quickly. In practical appli¬ 
cations, data on different nodes may represent data from 
different sources, or else not be identically distributed. The 
behavior of consensus in such situations is radically dif¬ 
ferent than when the data is identically distributed. The 
heterogeneous synthetic data models this scenario. 

We performed experiments using logistic regression, SVM, 
and Lasso. On different trial runs, we vary the number of 
cores used, the length of the feature vectors, and the num¬ 
ber of data points per core. Three examples of our results 
are illustrated in Figure 1, while more complete tables of 
results appear Appendix B. In addition, convergence curves 
for experiments on both homogeneous and heterogeneous 
data can be seen in Figures 2a and 2b. 

10.2. Empirical Case Study: Classifying Guide Stars 

We study the behavior of transpose reduction and con¬ 
sensus methods using the Second Generation Guide Star 
Catalog (GCS-II) (Lasker et al., 2008), an astronomical 
database containing 950 million stars and other astronom¬ 
ical objects. Intensity measurements in different spectral 
bands are recorded for each object, in addition to geomet¬ 
ric properties such as size and eccentricity. The GSC-II 
also contains a binary classification that labels every astro¬ 
nomical body as “star” or “not a star.” We train a sparse 
logistic classifier to discern this classification using only 
spectral and geometric features. 

A data matrix was compiled by selecting the 17 spectral 
and geometric measurements reported in the catalog, and 
also “interaction features” made of all second-order prod¬ 
ucts of the 17 measurements; all features were then nor¬ 
malized. After the addition of a bias feature, the resulting 
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(a) Logistic regression with homoge¬ 
neous data. Experiments used 50,000 
data points of 2,000 features each per 
computing core. 



(b) SVM with homogeneous data. Ex¬ 
periments used 50,000 data points of 100 
features each per computing core. The 
longest timed run of Transpose ADMM 
took about 11 minutes of total comput¬ 
ing time. 


Amount of Data <GB) 

250 500 



(c) Lasso with heterogeneous data. 

Experiments used 50,000 data points of 
200 features each per computing core. 


Figure 1. Selected results from timing Consensus ADMM (green) and Transpose ADMM (blue) when solving three different opti¬ 
mization problems with varying data corpus sizes. In each experiment, every core was given an identically-sized subset of the data, so 
data corpus size and number of cores are related linearly. The top horizontal axis denotes the total data corpus size, while the bottom 
horizontal axis denotes the number of computing cores used to process the data. The vertical axis denotes total computing time. 


matrix has 307 features per object, and requires 1.8 TB of 
space to store in binary form. 

We ran experiments observing the decrease of the global 
objective function as a function of wallclock time; these 
experiments showed that transpose ADMM converged far 
more quickly than consensus. Convergence curves for this 
experiment are shown in Figure 2c. 

We also experiment with the effect of storing the data ma¬ 
trix across different numbers of cores (i.e. increasing the 
number of nodes). These experiments illustrated that this 
variable had little effect on the relative performance of the 
two optimization strategies; transpose methods remained 
far more efficient regardless of the load on individual cores. 
Table 1 reports runtimes with varying numbers of cores. 

Note the strong advantage of transpose reduction over con¬ 
sensus ADMM in Figure 2c. This confirms the results of 
Section 10.1, where it was observed that transpose reduc¬ 
tion methods are particularly powerful for the heteroge¬ 
neous data observed in realistic datasets, as opposed to the 
identically distributed matrices used in many synthetic ex¬ 
periments. 

11. Discussion 

In all experiments, transpose reduction methods required 
substantially less computation time than consensus meth¬ 
ods. As seen in Figure 1, Figure 2, and Appendix B, per¬ 
formance of both transpose reduction and consensus meth¬ 
ods scales nearly linearly with the number of cores and 
amount of data used for classification problems. For the 
lasso problem (Figure Ic), the runtime of the transpose re¬ 
duction method appears to grows sub-linearly with num¬ 
ber of cores, in contrast to consensus methods that have a 


Cores 

T-Wall 

T-Comp 

C-Wall 

C-Comp 

2500 

0;01;06 

11:35:25 

0:24:39 

31d,19:59:13 

3000 

0:00:49 

12:10:33 

0:21:43 

32d, 2:44:11 

3500 

0:00:50 

12:17:27 

0:17:01 

30d, 7:56:19 

4000 

0:00:45 

12:38:24 

0:29:53 

40d, 13:38:19 


Table 1. Wall clock and Total compute times for the logistic re¬ 
gression problem on the Second Generation Guide Star Cata¬ 
log. Experiments show different numbers of cores used for 
the same 1.8TB dataset. Columns labeled with “T-” denote re¬ 
sults for transpose reduction ADMM, and the label “C-” de¬ 
notes consensus ADMM. Times are reported in the format (days,) 
hours minutes: seconds. 


strong dependence on this parameter. This is largely be¬ 
cause the transpose reduction method does all iteration on 
a single machine, whereas consensus optimization requires 
progressively more communication with larger numbers of 
cores. 

Note that transpose reduction methods require more startup 
time for some problems than consensus methods because 
the local Gram matrices Df Di must be sent to the central 
node, aggregated, and the result inverted; this is not true 
for the lasso problem, for which consensus solvers must 
also invert a local Gram matrix on each node, though this at 
least saves startup communication costs. This startup time 
is particularly noticeable when overall solve time is short, 
as in Figure 2b. Note, however, that even for this problem 
total computation time and wall time was still shorter with 
transpose reduction than with consensus methods. 
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(a) Homogeneous data. Experiments 
used 7,200 computing cores with 50,000 
data points of 2,000 features each. 



(b) Heterogeneous data. Experiments 
used 7,200 computing cores with 50,000 
data points of 2,000 features each. 



(c) Empirical star data. Experiments 
used 2,500 cores to classify a data set of 
1.8TB. Consensus ADMM did not termi¬ 
nate until 1160 seconds. 


Figure 2. Value of the logistic regression objective function as a function of wallclock time for Consensus ADMM (green) and Trans¬ 
pose ADMM (blue). 


11.1. Effect of Heterogeneous Data 

An important property of the proposed transpose reduction 
methods is that they solve global subproblems over the en¬ 
tire data corpus (as opposed to consensus methods where 
sub-problems involve a small portion of the distributed 
data). The power of solving global problems is very appar¬ 
ent when data is heterogeneous across nodes. Heterogene¬ 
ity means that data on different nodes has different statis¬ 
tical properties (as opposed to homogeneous experiments 
where every data matrix has identical properties). 

Heterogeneity has a strong effect on consensus methods. 
When data is heterogeneous across nodes it causes local 
sub-problems to differ, and thus the nodes have a stronger 
tendency to “disagree” on the solution, taking longer to 
reach a consensus. This effect is illustrated by a compari¬ 
son between Figures 2a and 2b, where consensus methods 
took much longer to converge on heterogeneous data sets, 
while the transpose method was not affected. 

In contrast, because transpose reduction solves global sub¬ 
problems across the entire distributed data corpus, it is 
relatively insensitive to data heterogeneity across nodes. 
Data heterogeneity explains the strong advantage of trans¬ 
pose reduction on the GSC-II dataset (Figure 2c, Table 1), 
which contains empirical data and is thus not uniformly dis¬ 
tributed. 

11.2. Communication & Computation 

The transpose reduction ADMM leverages a tradeoff be¬ 
tween communication and computation. When N nodes 
are used to solve a problem with a distributed data ma¬ 
trix D e each node in consensus ADMM trans¬ 

mits Xi G M" to the central server, which totals to 0{Nn) 
communication. Unwrapped ADMM requires 0(m) com¬ 
munication per iteration, which is often somewhat more. 
Despite this difference, transpose reduction methods are 


still highly efficient because (a) they require dramatically 
less computation, and (b) communication with the server is 
substantially more efficient, even when more information 
is sent. We discuss these two issues in detail below. 

First, unwrapped ADMM requires substantially less com¬ 
putation than consensus methods. Consensus ADMM re¬ 
quires inner iterations to solve expensive sub-problems on 
each node. Because logistic regression and SVM can be 
very expensive for large data sets, consensus optimization 
is often CPU bound rather than communication bound. 
In contrast, the sub-problems of unwrapped ADMM are 
coordinate-wise separable and generally available in closed 
form. 

Second, transpose reduction methods stay synchronized 
better than consensus ADMM, which makes communica¬ 
tion more efficient on synchronous architectures. The it¬ 
erative methods used by consensus ADMM for logistic re¬ 
gression and SVM sub-problems do not terminate at the 
same time on every machine, especially when the data is 
heterogeneous across nodes. Because all client nodes must 
complete their inner iterations before an iteration is com¬ 
plete, most of the communication overhead of ADMM is 
caused by calls to MPI routines that block until all nodes 
become synchronized. In contrast. Algorithm 2 requires 
the same computations on each server, allowing nodes to 
stay synchronized naturally. 

12. Conclusion 

We introduce transpose reduction ADMM — an iter¬ 
ative method that solves model fitting problems using 
global least-squares subproblems over an entire distributed 
dataset. Numerical experiments using both synthetic and 
empirical data suggest that the transpose reduction is sub¬ 
stantially more efficient than consensus methods. When 
the distributed dataset has heterogeneous properties across 
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nodes, the global properties of transpose reduction are 
particularly advantageous compared to consensus meth¬ 
ods, which solve many small sub-problems using different 
datasets. 
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Appendices 


A. SVM Sub-steps for Consensus Optimization 

The consensus SVM requires a solution to the problem (13). Despise the apparent similarity of this proximal-regularized 
problem to the original SVM (12), problem (13) cannot be put into a form that is solvable to popular solvers for (12). 
However, techniques for the classical SVM problem can be easily adapted to solve (13). 

A common numerical approach to solving (12) is the attack its dual, which is 


minimize 

aiG[0,C] 


- a^l = '^aiajkljAiAj 



(19) 


Once (19) is solved to obtain a*, the solution to (12) is simply given by w* = The dual formulation (19) is 

advantageous because the constraints on a act separately on each coordinate. The dual is therefore solved efficiently by 
coordinate descent, which is the approach used by the popular solver LIBSVM (Chang & Lin, 2011). This method is 
particularly powerful when the number of support vectors in the solution is small, in which case most of the entries in a 
assume the value 0 or C. 

In the context of consensus ADMM, we must solve 

minimize ^\\w\\'^ + Ch(Aw,l) +'’^\\w — . (20) 

Following the classical SVM literature, we dualize this problem to obtain 

minimize -\\A'^La\\'^ — {{1 + t)1 — tLz) . (21) 

ctiG[0,C] 2 

We then solve (21) for a*, and recover the solution via 

. La + TZ 

w = -:-. 


We solve (21) using a dual coordinate descent method inspired by (Chang & Lin, 2011). The implementation has 0{M) 
complexity per iteration. Also following (Chang & Lin, 2011) we optimize the convergence by updating coordinates with 
the largest residual (derivative) on each pass. 

Because our solver does not need to handle a “bias” variable (in consensus optimization, only the central server treats the 
bias variables differently from other unknowns), and by using a warm start to accelerate solve time across iterations, our 
coordinate descent method significantly outperforms even LIBSVM for each sub-problem. On a desktop computer with a 
Core i5 processor, LIBSVM solves the synthetic data test problem with m = 100 datapoints and n = 200 features in 3.4 
seconds (excluding “setup” time), as opposed to our custom solver which solves each SVM sub-problem for the consensus 
SVM with the same dimensions (on a single processor) in 0.17 seconds (averaged over all iterations). When m = 10000 
and n = 20, LIBSVM requires over 20 seconds, while the average solve time for the custom solver embedded in the 
consensus method is only 2.3 seconds. 


B. Tables of Results 

In the following tables, we use these labels: 

• N: Number of data points per core 

• F: Number of features per data point 

• Cores: Number of compute cores used in computation 

• Space: Total size of data corpus in GB (truncated at GB) 
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• TWalltime: Walltime for transpose method (truncated at seconds) 

• TCompute: Total computation time for transpose method (truncated at seconds) 

• CWalltime: Walltime for consensus method (truncated at seconds) 

• CCompute: Total computation time for consensus method (truncated at seconds) 


Logistic regression with homogeneous data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

2000 

800 

596 

0:00:53 

6:19:14 

0:01:36 

17:25:18 

50000 

2000 

1600 

1192 

0:00:58 

12:40:24 

0:01:51 

1 day 10:51:33 

50000 

2000 

2400 

1788 

0:01:00 

19:05:13 

0:01:52 

2 days 4:21:25 

50000 

2000 

3200 

2384 

0:01:00 

1 day 1:30:18 

0:01:41 

2 days 21:46:28 

50000 

2000 

4000 

2980 

0:00:58 

1 day 7:58:24 

0:01:39 

3 days 15:17:51 

50000 

2000 

4800 

3576 

0:00:58 

1 day 14:27:31 

0:02:31 

4 days 8:49:58 

50000 

2000 

5600 

4172 

0:01:00 

1 day 21:10:38 

0:02:13 

5 days 2:16:56 

50000 

2000 

6400 

4768 

0:01:03 

2 days 3:46:42 

0:02:08 

5 days 19:39:40 

50000 

2000 

7200 

5364 

0:01:21 

2 days 10:36:36 

0:01:47 

6 days 13:12:59 

5000 

2000 

4800 

357 

0:00:33 

4:04:11 

0:00:26 

21:01:22 

10000 

2000 

4800 

715 

0:00:26 

7:51:06 

0:01:22 

1 day 21:24:47 

15000 

2000 

4800 

1072 

0:00:38 

11:23:22 

0:01:37 

2 days 19:42:30 

20000 

2000 

4800 

1430 

0:00:42 

15:15:01 

0:01:30 

3 days 19:27:24 

25000 

2000 

4800 

1788 

0:00:42 

18:59:04 

0:01:48 

4 days 17:24:59 

30000 

2000 

4800 

2145 

0:00:47 

22:53:25 

0:02:04 

5 days 16:30:28 

35000 

2000 

4800 

2503 

0:00:57 

1 day 2:43:48 

0:02:46 

6 days 15:10:40 

40000 

2000 

4800 

2861 

0:00:54 

1 day 6:22:51 

0:02:42 

7 days 14:58:02 

45000 

2000 

4800 

3218 

0:00:57 

1 day 10:05:17 

0:03:02 

8 days 15:11:42 

50000 

2000 

4800 

3576 

0:01:02 

1 day 14:28:30 

0:03:24 

9 days 15:51:21 

20000 

500 

4800 

357 

0:00:05 

2:18:21 

0:00:35 

20:51:20 

20000 

1000 

4800 

715 

0:00:12 

5:33:31 

0:01:40 

1 day 18:43:21 

20000 

1500 

4800 

1072 

0:00:25 

9:44:07 

0:01:08 

2 days 20:08:20 

20000 

2000 

4800 

1430 

0:00:31 

15:10:01 

0:01:29 

3 days 19:28:56 

20000 

2500 

4800 

1788 

0:01:23 

1 day 12:24:25 

0:03:30 

4 days 20:53:53 

20000 

3000 

4800 

2145 

0:01:50 

1 day 20:29:59 

0:03:44 

5 days 19:45:31 

20000 

3500 

4800 

2503 

0:02:27 

2 days 5:40:09 

0:03:56 

6 days 19:44:54 

20000 

4000 

4800 

2861 

0:03:03 

2 days 16:50:51 

0:03:46 

7 days 18:17:21 

20000 

4500 

4800 

3218 

0:04:00 

3 days 3:35:02 

0:04:28 

8 days 19:49:26 

20000 

5000 

4800 

3576 

0:04:52 

3 days 16:50:21 

0:04:44 

9 days 23:56:16 


Logistic regression with heterogeneous data 

N 

F 

Cores 

Space(GB) TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

2000 

800 

596 

0:00:56 

6:14:57 

0:09:25 

3 days 19:56:38 

50000 

2000 

1600 

1192 

0:01:01 

12:28:12 

0:09:35 

7 days 19:00:17 

50000 

2000 

2400 

1788 

0:00:58 

18:43:11 

0:09:35 

11 days 13:26:10 

50000 

2000 

3200 

2384 

0:00:58 

1 day 1:09:09 

0:09:39 

15 days 10:33:19 

50000 

2000 

4000 

2980 

0:01:23 

1 day 7:34:22 

0:09:49 

19 days 6:45:31 

50000 

2000 

4800 

3576 

0:01:11 

1 day 13:51:15 

0:34:30 

77 days 5:23:50 

50000 

2000 

5600 

4172 

0:01:29 

1 day 20:20:50 

0:34:38 

90 day 19:10:12 

50000 

2000 

6400 

4768 

0:01:01 

2 days 2:55:20 

0:35:31 

103 days 19:09:22 

50000 

2000 

7200 

5364 

0:01:14 

2 days 9:38:02 

0:10:26 

34 days 20:11:28 
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Lasso with heterogeneous data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

200 

800 

59 

0:00:12 

0:01:45 

0:00:37 

0:04:55 

50000 

200 

1600 

119 

0:00:02 

0:03:31 

0:00:47 

0:10:56 

50000 

200 

2400 

178 

0:00:02 

0:05:14 

0:01:14 

0:17:50 

50000 

200 

3200 

238 

0:00:00 

0:07:00 

0:01:22 

0:25:24 

50000 

200 

4000 

298 

0:00:04 

0:09:00 

0:01:36 

0:33:49 

50000 

200 

4800 

357 

0:00:11 

0:10:25 

0:01:57 

0:43:29 

50000 

200 

5600 

417 

0:00:10 

0:12:09 

0:02:07 

0:55:47 

50000 

200 

6400 

476 

0:00:07 

0:13:48 

0:02:19 

1:04:51 

50000 

200 

7200 

536 

0:00:09 

0:15:31 

0:02:39 

1:19:22 

50000 

1000 

800 

298 

0:00:04 

0:33:28 

0:05:20 

2:58:02 

50000 

1000 

1600 

596 

0:00:18 

1:06:33 

0:06:23 

6:00:37 

50000 

1000 

2400 

894 

0:00:25 

1:39:50 

0:08:28 

9:04:25 

50000 

1000 

3200 

1192 

0:00:09 

2:12:14 

0:08:34 

12:07:04 

50000 

1000 

4000 

1490 

0:00:08 

2:46:27 

0:09:52 

15:13:18 

50000 

1000 

4800 

1788 

0:00:21 

3:24:38 

0:13:28 

18:11:34 

50000 

1000 

5600 

2086 

0:00:10 

3:50:29 

0:14:55 

21:25:49 

50000 

1000 

6400 

2384 

0:00:06 

4:26:31 

0:16:11 

1 day 0:27:56 

50000 

1000 

7200 

2682 

0:00:11 

4:56:57 

0:17:11 

1 day 3:34:19 


SVM with homogeneous data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

20 

48 

0 

0:00:01 

0:00:46 

0:02:45 

2:01:12 

50000 

20 

96 

0 

0:00:01 

0:01:32 

0:02:47 

4:03:05 

50000 

20 

144 

1 

0:00:02 

0:02:19 

0:02:49 

5:58:08 

50000 

20 

192 

1 

0:00:02 

0:03:06 

0:02:45 

7:56:14 

50000 

20 

240 

1 

0:00:02 

0:03:53 

0:02:51 

9:54:05 

50000 

50 

48 

0 

0:00:03 

0:01:23 

0:05:14 

3:44:06 

50000 

50 

96 

1 

0:00:03 

0:02:47 

0:05:19 

7:26:30 

50000 

50 

144 

2 

0:00:03 

0:04:11 

0:05:25 

11:07:51 

50000 

50 

192 

3 

0:00:07 

0:05:38 

0:05:25 

14:54:03 

50000 

50 

240 

4 

0:00:03 

0:07:00 

0:05:25 

18:26:10 

50000 

100 

48 

1 

0:00:05 

0:02:20 

0:09:28 

6:25:55 

50000 

100 

96 

3 

0:00:05 

0:04:40 

0:09:56 

12:49:20 

50000 

100 

144 

5 

0:00:05 

0:07:04 

0:09:45 

19:09:22 

50000 

100 

192 

7 

0:00:06 

0:09:25 

0:09:53 

1 day 1:27:47 

50000 

100 

240 

8 

0:00:05 

0:11:46 

0:10:06 

1 day 8:08:18 


Cores 

TWalltime 

TCompute 

CWalltime 

CCompute 

2500 

0:01:06 

11:35:25 

0:24:39 

31 days 19:59:13 

3000 

0:00:49 

12:10:33 

0:21:43 

32 days 2:44:11 

3500 

0:00:50 

12:17:27 

0:17:01 

30 days 7:56:19 

4000 

0:00:45 

12:38:24 

0:29:53 

40 days 13:38:19 


Star data 




